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Predictive or treatment selection biomarkers are usually evalu¬ 
ated in a subgroup or regression analysis with focus on the treatment- 
by-marker interaction. Under a potential outcome framework (Huang, 

Gilbert and Janes [Biometrics 68 (2012) 687-696]), a predictive 
biomarker is considered a predictor for a desirable treatment benefit 
(defined by comparing potential outcomes for different treatments) 
and evaluated using familiar concepts in prediction and classifica¬ 
tion. However, the desired treatment benefit is unobservable because 
each patient can receive only one treatment in a typical study. Huang 
et al. overcome this problem by assuming monotonicity of potential 
outcomes, with one treatment dominating the other in all patients. 

Motivated by an HIV example that appears to violate the monotonic¬ 
ity assumption, we propose a different approach based on covariates 
and random effects for evaluating predictive biomarkers under the 
potential outcome framework. Under the proposed approach, the pa¬ 
rameters of interest can be identified by assuming conditional in¬ 
dependence of potential outcomes given observed covariates, and a 
sensitivity analysis can be performed by incorporating an unobserved 
random effect that accounts for any residual dependence. Application 
of this approach to the motivating example shows that baseline viral 
load and CD4 cell count are both useful as predictive biomarkers for 
choosing antiretroviral drugs for treatment-naive patients. 

1. Introduction. Much of contemporary medical research is focused on 
treatment effect heterogeneity, that is, the fact that the same treatment 
can have different effects on different patients. The increasing awareness of 
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Fig. 1. Nonparametric regression analysis of the THRIVE data: smoothed estimates 
(thicker lines) and pointwise 95% confidence intervals (thinner lines) for treatment-specific 
response rates as functions of baseline viral load (left) and CDf cell count (right). 


treatment effect heterogeneity has motivated the development of predictive 
biomarkers for identifying the subpopulation of patients who would actually 
benefit from a new treatment [e.g., Simon (2008, 2010)]. Classical examples 
of predictive biomarkers include genetic markers for cancer treatment, such 
as the OncoType Dx multi-gene score for breast cancer [Paik et al. (2004)] 
and the K-RAS gene expression level for colorectal cancer [Karapetis et al. 
(2008)]. This article is motivated by a new and growing interest in the possi¬ 
bility of using baseline viral load or CD4 cell count as a predictive biomarker 
for treating human immunodeficiency virus type 1 (HIV-1). 

Evaluation of a predictive biomarker is usually based on a subgroup or 
regression analysis comparing treatment effects on different subpopulations 
defined by the biomarker [e.g., Gail and Simon (1985), Russek-Cohen and 
Simon (1998), Pocock et al. (2002)]. Under this approach, the performance 
of a predictive biomarker is measured by the interaction between marker 
value and treatment assignment in a regression model for the clinical out¬ 
come of interest, which will be referred to as an outcome model. For exam¬ 
ple, consider the THRIVE study, a phase 3, randomized, noninferiority trial 
comparing rilpivirine, a newly developed nonnucleoside reverse transcriptase 
inhibitor, with efavirenz in treatment-naive adults infected with HIV-1 [Co¬ 
hen et al. (2011)]. The outcome of primary interest is a binary indicator of 
virologic response at week 48 of treatment (see Section 4 for details). Our 
consideration of baseline viral load and CD4 cell count as predictive biomark¬ 
ers (for choosing between efavirenz and rilpivirine) is motivated by Fig¬ 
ure 1, which shows nonparametric estimates of treatment-specific response 
rates as functions of marker value, separately for each biomarker. Figure 1 
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suggests a qualitative interaction between treatment and each biomarker, 
with rilpivirine favored over efavirenz for higher values of baseline CD4 cell 
count and lower values of baseline viral load. Because the statistical signifi¬ 
cance in Figure 1 is not straightforward to assess, a simple logistic regression 
analysis that includes treatment, marker and their interaction is performed 
(separately for each marker), and the resulting p -value for the treatment- 
by-marker interaction is 0.059 for viral load and 0.031 for CD4 cell count. 

While informative about possible interactions, Figure 1 is less transparent 
about the predictive performance of these biomarkers and their comparison. 
Huang, Gilbert and Janes (2012) point out that a strong interaction is not 
sufficient for adequate performance of a predictive biomarker, that the scale 
of the interaction coefficient depends on the functional form of the outcome 
model, and that the interaction-based approach is ill-suited for develop¬ 
ing combination markers. These authors also propose a potential outcome 
framework where a predictive biomarker is—as the term suggests—treated 
as a predictor for a desirable treatment benefit. Note that a treatment ben¬ 
efit is necessarily the result of comparing potential outcomes for different 
treatments applied to the same patient. Under this perspective, predictive 
biomarkers should be evaluated using familiar measures in prediction and 
classification [e.g., Zhou, Obuchowski and McClish (2002), Pepe (2003), Zou 
et al. (2011)]. Specifically, one should consider the true and false positive 
rates of a binary marker and the receiver operating characteristic (ROC) 
curve for a continuous marker. This approach allows different markers to be 
compared on the same scale and facilitates the development of combination 
markers. 

The objective of this article is to evaluate and compare baseline viral load 
and CD4 cell count as predictive biomarkers under the potential outcome 
framework. It is important to distinguish this objective from the related 
problems of identifying potential markers, combining several markers into 
a hybrid marker, choosing the cutoff point for a given marker and, more 
generally, developing an individualized treatment strategy. Variable selec¬ 
tion techniques such as lasso-based methods have been used to select and 
combine genetic markers [e.g., Tian et al. (2012)]. Nonparametric multivari¬ 
ate methods [e.g., Su et al. (2008), Foster, Taylor and Ruberg (2011), Qian 
and Murphy (2011)] and semiparametric methods [e.g., Zhang et al. (2012)] 
have been used to develop a treatment rule (i.e., a set of criteria for selecting 
patients), which can be considered a binary hybrid marker obtained by com¬ 
bining multiple markers. A natural question that arises from the THRIVE 
study is whether or how to combine baseline viral load and CD4 cell count 
into a hybrid marker with improved predictive accuracy. While that ques¬ 
tion is beyond the scope of this article, we note that the potential outcome 
framework and the proposed methods are applicable to any given marker. 
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Once a hybrid marker is developed, it can be evaluated and compared to 
the individual markers in the same framework using the same methods. 

An analytical challenge for the potential outcome framework is that the 
desired treatment benefit, which involves potential outcomes under different 
treatments, is usually unobservable because each patient can receive only 
one treatment in a typical study. A possible exception to this limitation is 
a cross-over study, which has its own issues [e.g., Poulson, Gadbury and 
Allison (2012)] and will not be discussed in this article. In a typical clinical 
study such as the THRIVE study, the standard methodology in prediction 
and classification is not directly applicable to a predictive biomarker. To 
address this issue and the resulting identification problem, Huang, Gilbert 
and Janes (2012) make a monotonicity assumption, namely, that one treat¬ 
ment dominates the other in all individual patients, and suggest a sensitivity 
analysis for possible departures from the monotonicity assumption. While 
the monotonicity assumption may be plausible in some situations such as 
vaccine trials, it may be less appealing as a starting point in other situations. 
In the THRIVE study, for example, the presence of a qualitative interaction 
(in the sense that neither treatment has a higher response rate for all realis¬ 
tic marker values) implies that neither treatment is dominant in all patients. 
Additionally, the approach of Huang, Gilbert and Janes (2012) is developed 
for a binary outcome and not readily extensible to other types of outcomes. 

In this article, we propose alternative methods that do not require mono¬ 
tonicity or assume a binary outcome. Our first step is to account for the 
dependence between potential outcomes (for different treatments applied to 
the same patient) by adjusting for relevant covariates, such as demographic 
variables and baseline characteristics. If the set of measured covariates is 
sufficient for explaining the dependence between potential outcomes, the 
aforementioned performance measures can then be identified by assuming 
conditional independence of potential outcomes given covariates. Possible 
violations of this assumption can be addressed by introducing a random ef¬ 
fect to account for residual dependence. In the next section we formulate 
the problem in terms of potential outcomes and provide a general ratio¬ 
nale for the proposed approach. The proposed methods are then described 
in Section 3 and applied to the THRIVE study in Section 4. The article 
ends with a discussion in Section 5. Some technical details are provided in 
a supplemental article [Zhang et al. (2014)]. 

2. Notation and rationale. Suppose a randomized clinical trial is con¬ 
ducted to compare an experimental treatment with a control treatment, 
which may be a placebo or a standard treatment, with respect to a clinical 
outcome of interest, which may be discrete or continuous. Because there is 
only one outcome of primary interest in the THRIVE study, we will work 
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with a scalar outcome unless otherwise noted. However, most of our method¬ 
ology is readily applicable to a vector-valued outcome, with the exception 
of the sensitivity analysis in Section 3.2 (described fully in Web Appendix 
B). For a generic patient in the target population, let Y(t) denote the po¬ 
tential outcome that will result if the patient receives treatment t (0 for 
control; 1 for experimental). Note that the Y(t), t = 0,1, cannot both be 
observed at a given time. Let T denote the treatment assigned randomly 
to a study subject, thus T is a Bernoulli variable independent of all base¬ 
line variables. Without considering noncompliance, which is negligible in 
the THRIVE study, we assume that T is also the actual treatment given to 
the subject, and write Y = Y ( T ) for the actual outcome. Should noncom¬ 
pliance become a major issue, we could take an intent-to-treat perspective 
and compare treatment assignments or use analytical techniques to recover 
the actual treatment effect [e.g., van der Laan and Robins (2003)]. We as¬ 
sume that large values of Y are desirable. Where necessary, the subscript 
i = 1 ,... ,n will be attached to random variables to denote individual pa¬ 
tients in the trial. 

Our interest is in evaluating a predictive biomarker Z, a baseline variable 
which may be binary or continuous, with higher marker values support¬ 
ing the use of the new treatment. The biomarker Z is intended to identify 
the subpopulation of patients who would benefit from the new treatment 
relative to the control. It can be a continuous variable as in our moti¬ 
vating example or a binary one such as a treatment rule developed us¬ 
ing nonparametric multivariate methods. Let the desired treatment bene¬ 
fit be indicated by B = /{(V(0), V(l)) € B}, where /{•} is the indicator 
function and B is the set of desirable outcomes. Note that B is by defi¬ 
nition a comparison of the two potential outcomes. For a binary outcome, 
B might be an indicator for V(0) < V(l) or V(0) < Y( 1), depending on 
which treatment is preferable with identical (efficacy) outcomes. In our ex¬ 
ample, we set B = /{T(0) < V(l)} because rilpivirine is thought to have a 
better safety profile and will likely be preferred over efavirenz when their 
efficacy outcomes are identical. For a continuous outcome, we might take 
B = {(yo, yi ): yi — 2/o > <5}, where 5 reflects considerations of cost, clinical sig¬ 
nificance and possibly the safety profiles of the two treatments (if not incor¬ 
porated into a vector-valued outcome). For an ordered categorical outcome, 
the definition of B may be more complicated. We shall take the definition of 
B as given and focus on the evaluation of Z for predicting B. The target B 
is an intrinsic characteristic of an individual patient, which suggests that Z 
can be evaluated using well-known quantities in prediction and classification 
[e.g., Zhou, Obuchowski and McClish (2002), Pepe (2003), Zou et al. (2011)]. 
For a binary marker, it makes sense to consider the true and false positive 
rates, defined as TPR = P(Z = 1|L> = 1) and FPR = P (Z = 1 |B = 0), respec¬ 
tively. For a continuous marker, it is customary to consider the ROC curve 
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defined as 


ROC(s) = 1 - F zlB=1 {F~' B=0 (l - s)}. 


Here and in the sequel, we use F to denote a generic (conditional) distri¬ 
bution function, with the subscript indicating the random variable(s) con¬ 
cerned. The ROC curve is simply a plot of TPR versus FPR for classifiers 
of the form I(Z > z), with the threshold 2 ranging over all possible values. 

Because B is never observed, the existing methodology for evaluating pre¬ 
dictors, which generally assumes that B can be observed, cannot be used 
directly to evaluate a predictive biomarker. Nonetheless, we note that TPR, 
FPR and ROC are all determined by F Z \ B , which can be recovered using 
Bayes’ rule from the marginal distribution of Z and the conditional prob¬ 
ability nz(z) = P (B = 1| Z = z). For a binary marker, 1 — 7r^(0) and 7T^(1) 
are negative and positive predictive values, respectively, 


(1) 


TPR = 


_ TTT Z {1) _ 

TTT Z { 1 ) + (1 - t)tT Z (0) ’ 


FPR = 


t{1 


t{1 - 7r z (l)} _ 

7T Z (1)} + (1 - T ){1 - 7T Z (0)} ’ 


where r = P(Z = 1). For a continuous marker, we have 


(2) 


/ zo . roo 

■K Z (z)dF z (z)f / ir z (z)dF z (z), 


' —00 

r z 0 


Fz\b=o 


/ z 0 roo 

{1-7 T Z (z)}dF z (z) / / {1-7 T Z (z)}dF Z (z). 

-00 ' J—00 


Since Z is fully observed, the identifiability of F z \ B would follow from that of 
7 tz(z). Once an estimate of irz(z) is available, it can be substituted into the 
above displays together with an empirical estimate of t or Fz, depending 
on the nature of Z. 

Despite its simple appearance, ttz(z) is not straightforward to estimate. 
In fact, for any conceivable form of B, the quantity 7 tz(z) = P{(T(0), Y (1)) £ 
B\Z = z} is not empirically identifiable because it involves the joint distribu¬ 
tion of T(0) and T(l) given Z = z. Owing to randomization, it is straight¬ 
forward to identify 


F t \z(y\z) = P{Y(t) < y\Z = z} = P(T <y\T = t,Z = z) 

for each t £ {0,1}, and to estimate it from a regression analysis for Y given 
T and Z. However, the dependence structure of Y (0) and Y (1) given Z = z 
is not identifiable from the data [e.g., Gadbury and Iyer (2000)], which is 
also known as the fundamental problem of causal inference [Holland (1986)]. 
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Because tt z (z ) is not determined by the “marginals” F t \ z (t = 0,1), its identi¬ 
fication and estimation require additional information or assumptions about 
the dependence between Y(0) and Y( 1) given Z = z. For a binary outcome, 
this can be achieved by assuming nronotonicity [i.e., Y(0) < Y( 1) with prob¬ 
ability 1] as in Huang, Gilbert and Janes (2012). The nronotonicity assump¬ 
tion corresponds to maximal positive dependence of Y(0) and Y(l). 

For a general outcome and without assuming monotonicity, we develop 
alternative methods by adapting the techniques of Dodd and Pepe (2003) 
and Zhang et al. (2013). To account for the dependence of Y(0) and Y(l), 
we start by conditioning on relevant covariates that are associated with both 
outcomes. Let X denote a vector of such covariates measured at baseline, 
which may include prognostic factors and effect modifiers. In the THRIVE 
study, X may include gender, race, and baseline age and body mass index. 
We include Z as a component of X and write X = (Z, W), where W con¬ 
sists of the additional baseline covariates. Writing 7Tx(x) = P{(Y(0), Y(l)) £ 
£>|X = x}, a conditioning argument yields 

(3) ttz(z) =Y{tt x 0Q\Z = z} = J ir x (z,w)F w iz(dw\z). 

Because Fyy\ z is empirically identifiable and estimable, the challenge now is 
to identify and estimate 7Tx(x). 

If X is sufficient for explaining the dependence between Y(0) and Y(l), 
then we can expect that 

(4) Y(0)_LY(1)|X, 

that is, that Y(0) and Y(l) are conditionally independent given X. This 
assumption cannot be verified with the observed data and must be based on 
external information. Under assumption (4), the joint distribution of Y(0) 
and Y( 1) given X is determined by the “marginals” 

F t\x(y\*) = P {Y(t) < y\X = x} = P(Y < y\T = t, X = x) (t = 0,1), 

which are straightforward to identify and estimate. In reality, assumption (4) 
can be violated because X may not explain all the dependence between Y (0) 
and Y(l). Such violations can be examined in a sensitivity analysis based 
on a latent variable that accounts for any residual dependence between Y (0) 
and Y(l). Under this approach, assumption (4) is relaxed as follows: 

(5) Y(0) -L Y(1)|(X, U), 

where U is a subject-specific latent variable that is independent of X. In 
other words, U represents what is missing from X that makes assumption 
(4) break down. Assumption (5) alone is not sufficient to identify 7rjv(x) 
because U is unobserved. However, by specifying certain quantities related 
to U, we can perform a sensitivity analysis based on assumption (5), as we 
demonstrate in the next section. 


ZHANG, NIE, SOON AND LIU 


3. Methodology. We now describe methods for estimating the aforemen¬ 
tioned performance measures (TPR, FPR and ROC). As indicated earlier, 
we will start by estimating ttx (x) under assumptions (4) and (5). This can 
be done using a direct approach and an indirect approach, to be described 
in Sections 3.1 and 3.2, respectively. The direct and indirect approaches are 
based on models for ttx(x) and F Y \T,x(y |i,x), respectively, which we refer 
to as benefit and outcome models. A benefit model is directly informative 
about 7Tx (x) and thus more interpretable in the present context, while an 
outcome model is more familiar to practitioners and easier to estimate and 
validate using standard techniques. Further comments comparing the two 
approaches are given at the end of Section 3.2 after the approaches are 
described and in Section 5. In Section 3.3 we show how to convert an esti¬ 
mate of 7Tx (x) into one of nz(z). Estimates of the performance measures of 
interest are given in Section 3.4. 


3.1. Direct estimation of vrx(x) based on a Benefit model. A benefit 
model is a parametric model for P (B = 1|X) = P{(T(0), Y (1)) € £>|X}, such 
as the following generalized linear model (GLM): 

(6) 7rx(X; a) = + ar^X), 


where a = ( ai,ot' x )' is the regression parameter and if is an inverse link 
function. Since B is binary, the probit and logit links are natural choices. 

Suppose the conditional independence assumption (4) holds. To gain some 
intuition, consider a discrete X taking values in {xi,... ,xjf}. Within each 
stratum defined by X = x*., assumption (4) implies that E(0) and Y( 1) are 
independent of each other, as if they arise from different subjects, which we 
assume are independent. In other words, given that X, = X ; -, the natural 
pair (F)(0), F)(l)) is identically distributed as the artificial pair (F)(0), Y)(l)). 
If Ti = 0 and Tj = 1, then (F)(0), F)(l)) is observable as (F),Fj), so that 
Tv(xfc) = E(/?|X = Xfc) can be estimated by 


( 7 ) 


1 

noknik 


X X B ‘i- 

ieSofc 


where B^ = /{(F), Yj) e B}, S t k = {i:Ti=t, X* = x^}, and n t k denotes the 
size of S t ,k (t = 0, 1 ; k = 1 ,..., K ). 

The question is how to generalize this idea to a nondiscrete X. To follow 
the logic of (7), one would need to find subjects from different treatment 
groups with the same value of X, which becomes difficult when X has contin¬ 
uous components (4 in our example). To overcome that problem, we borrow 
ideas from Dodd and Pepe (2003), who consider semiparametric regression 
for the area under the ROC curve, and work with an expanded model given 
by 

(8) P {Bij = l|x,„ X,; (3) = ififi 1 + tfr(Xi + X,)/2 + ^(X,- - X,)}, 
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where (3 = ((3i, /3' x , /3' d xY> * G <Sok and j G <Sifc. The new features of model 

(8) relative to model ( 6 ) are introduced for the sole purpose of estimating 
a. Our research question does not pertain to the left-hand side of ( 8 ) or the 
regression coefficient f3 d x- However, assumption (4) implies that P(£>,j = 
l|Xj = Xj = x) = P(H = 1|X = x). Thus, when X* = Xj, model ( 8 ) reduces 
to model ( 6 ) with 

( 9 ) a = (Pi,f3' x y ■ 


In that sense, model ( 8 ) is a helper model that allows us to estimate model 
( 6 ) with the observed data. Let St = {i : T) = t} and let nt denote the size of 
St (t = 0,1). Then the regression parameter (3 in model ( 8 ) can be estimated 
by solving the equation 


( 10 ) 


»=EE 

ie<S 0 jeSi 


ditij Bij Hjj 

d(3 TTijil-TTij)' 


where 


71 ij = 4>{Pi + P'x&i + Xj)/2 + fc(Xj - X,)}. 

As suggested by Dodd and Pepe (2003), equation (10) need not include 
all possible pairs (i,j); it could be based on a subset of pairs such that 
||Xj — Xj|| < e, where || • || denotes Euclidean norm, for some e > 0. The 
choice of e represents a bias-variance trade-off, where a larger e leads to 
better efficiency and stability and also more sensitivity to the last component 
of model ( 8 ). 

The approach just described relies heavily on the conditional indepen¬ 
dence assumption (4), which relates model ( 8 ) to model ( 6 ) through equation 
(9). Equation (9) does not hold when assumption (4) is violated. However, 
under alternative assumptions, we have 

(11) a = 7 (Pi, fix)' 

for a scalar 7 . The key assumptions for (11) include assumption (5) and a 
GLM-like structure analogous to model ( 6 ): 

( 12 ) P (B = 1 |X, U ; a*) = ^a\ + ^X + c$U), 

where a* = In Section A of the supplemental article [Zhang 

et al. (2014)], we state additional assumptions that lead to (11) and give an 
expression for 7. Since (3 is identifiable and estimable using the techniques 
described earlier, a can be estimated as soon as 7 is known or estimated. 
Unfortunately, 7 is unidentifiable from the observed data. For the probit and 
logit links, we show in Section A of the supplemental article [Zhang et al. 
(2014)] that 7 can take any value greater than 2 -1 / 2 ~ 0.71. Thus, when 
assumption (4) is in doubt, we can perform a sensitivity analysis based on 
specified values of 7 G ( 2 ^ 1 / 2 ,oo), with 7=1 corresponding to conditional 
independence. 
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3.2. Indirect estimation of ir x ( x) based on an outcome model. An out¬ 
come model is a parametric model, say, F Y\T,x(y |i,x;0), for the conditional 
distribution of Y given T and X, specified up to a finite-dimensional pa¬ 
rameter 9. Let /y|T,x(2/|i) x ; 0) denote the associated probability density or 
mass function. A typical outcome model would be a GLM with the following 
mean structure: 

(13) E(Y\T, X; 9) = ^ + 9 t T + 9' x X + 9' tx (TX)}, 

where 9 = Qx^tx)' an d is an inverse link function. The parameter 

9 can be estimated by maximizing the likelihood n2=i 9), 

and the resulting maximum likelihood estimate (MLE) will be denoted by 
9. Because of randomization, the outcome model implies 

that 


Ft\x(y\x) = P{^ (t) < y |x = x} = F Y \ T ,x(v\t , x ; o) (t = o, 1), 

which can be estimated by substituting 9 for 9. 

Under the conditional independence assumption (4), the joint distribution 

F -\x(yo,yi\x) = p {^(0) <yo,Y(l) <j/i|X = x} 

is identified as 


F o\X (?/o l x )Pi|x (l/i l x ) = f y\t,x ( 2 / 0 1 x ; 9')Fy\T,x ( 2 / 1 11, x; 0) 
and estimated by replacing 9 with 9. Write 

P)fi(2/o,yi| x ) = i 7 V|T,x(yo|O,x;0)Fy| T)X (yi|l,x;0), 

where the superscript Cl stands for conditional independence. The corre¬ 
sponding estimate of ^t x (x) is then given by 

JI{(yo,yi) G B}F^(dy 0 ,d yi \x) =: F™(B\x). 

When assumption (4) is in doubt, we can perform a sensitivity analysis 
based on assumption (5), which implies that 


(14) 


F -\x(yo,yi\*) 


J F -\x,u(.yo,yi\x,u)Fu(du) 


J F 0 ]XtU (yo\x,u)F 1 \ XtU (y 1 \x,u)F u (du), 


where we generalize the previous notation in an obvious way (with U as an 
additional conditioning variable). This suggests that we specify a model, say, 
Fy\T,x,u{y\t> x > u 'i 0*)j for the conditional distribution of Y given (T,X,U), 
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with a finite-dimensional parameter 6*. Analogous to the GLM (13), we 
work with a generalized linear mixed model (GLMM) with 

(15) E(Y\T, X, U; 6*) = ij}{0{ + 6* t T + 0^X + 0f' x (TX) + 6* V U}, 

where 6* = 0^, @*x, 9 txi ^u)'■ The GLMM is not completely identified 

without additional information, and we propose a sensitivity analysis based 
on specified values of 6y (or, rather, \0y\), which is described in Section B 
of the supplemental article [Zhang et al. (2014)]. 

It is worth mentioning that the random effect U has a different interpre¬ 
tation here than in Section 3.1. In model (15), U represents an unobserved 
prognostic factor which affects both potential outcomes in the same direc¬ 
tion; a change in U may or may not have much effect on the treatment 
benefit B, depending on the precise definition of B and model (15). Al¬ 
though one could incorporate a random treatment effect into model (15), 
the resulting method will likely become very complicated. In model (12), U 
acts like an effect modifier in that a change in U leads directly to a change 
in the probability of a desirable treatment benefit. (Here we use the term 
effect modifier in a heuristic sense which may or may not agree with an 
interaction-based definition.) Thus, aside from modeling assumptions, the 
direct and indirect approaches also differ in how they deal with departures 
from assumption (4). The indirect approach is designed to address viola¬ 
tions of assumption (4) due to an unmeasured prognostic factor, whereas 
the direct approach is more appropriate for violations of assumption (4) due 
to an unmeasured effect modifier. 

3.3. Estimation of ttz(z). Equations (1) and (2) suggest that evaluation 
of a predictive biomarker Z can be based on ttz(z) = P(B = l\Z = z) and the 
marginal distribution of Z . Because the latter is straightforward to estimate, 
this section is focused on estimation of -kz {z) with a given estimate of ttx (x) , 
say, 7r y (x) , which may be obtained using any one of the proposed methods. 
For a binary marker, equation (3) suggests that ttz(z) can be estimated by 
Y17=i^{Zi = z:}7Tx(z, Wj)/ J2?=i I{Zi = z}. We therefore assume that Z is 
continuous in the rest of this section. 

We propose to estimate 7rz(z ) by substituting an estimate of F w \ z into 
equation (3). One could specify a parametric model for F w \ z , however, this 
can be difficult because the dimension of W can be rather large (5 in our 
example). We therefore exploit the fact that Z is only one-dimensional and 
employ nonparametric regression techniques in estimating F w \ z - Let £ :M —>■ 
[0, oo) be a kernel function and A > 0 a bandwidth parameter. Then we can 
estimate F w \ z {w\z) by 

ig{(^-^)/A}/(w t <w) 
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This, together with 7?x(x), can be substituted into equation (3) to estimate 
ttz(z) as 

nz(z) = -EEFIuTFA}-' 

An important question here is how to choose the bandwidth A, for which 
we propose a cross-validation approach. The estimate ttz(z) can be regarded 
as a nonparametric regression of ttx(Z, W) on Z, and its predictive accuracy 
can be assessed by comparing the “response” Bi = TixO^i) with the estimate 
Bi = nz(Zi). We propose to partition the sample into a training set and a 
validation set and choose a value of A that minimizes the average of (Bi — 
Bi) 2 in the validation set with nz(z) estimated from the training set using 
bandwith A. 


3.4. Estimation of TPR, FPR and ROC. Given ttz(z) from Section 3.3, 
the parameters of interest can be estimated using equations (1) and (2). For 
a binary marker, this leads to 


TPR = 


_ T7Tz(1) _ 

TKZ(I) + (1 -t)7Tz(0)’ 


FPR = 


r{l -TT Z (1)} 


r{ 1 - vfz(l)} + (1 - f){l - vfz(O)} : 


where r = n 1 Z%- For a continuous marker, we have 


ROC(s) = 1 - F z \ b= 1 {F~{ b=0 (1 - s)}, 


where 


F Z \B=l(zo) = S T j I(Z i < Z 0 )TTz(Zi) t /^TTziZi), 
i =1 i =1 


n n 

F zlB=0 (z 0 ) = Y, I(Zi < Zo){l - 7T z(Zi)} / - 7 fz(Zi)}. 

i =1 i=1 

An asymptotic analysis of these estimates can be rather tedious, espe¬ 
cially because ttz(z) involves smoothing and cross-validation. For inference, 
we recommend the use of bootstrap confidence intervals. To account for 
all variability in the estimates, the entire estimation procedure, including 
bandwidth selection based on cross-validation, should be repeated for each 
bootstrap sample. 


4. Application to the THRIVE study. We now apply the methods of 
Section 3 to the THRIVE study introduced in Section 1, a randomized, 
double-blind, double-dummy, noninferiority trial at 98 hospitals or medical 
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centers in 21 countries [Cohen et al. (2011)]. The THRIVE study compared 
rilpivirine with efavirenz for treating HIV-1 in treatment-naive adults, in the 
presence of common background nucleoside or nucleotide reverse transcrip¬ 
tase inhibitors (N[t]RTIs). The study enrolled 680 adult patients who were 
naive to antiretroviral therapy, with a screening viral load of at least 5000 
copies per ml and viral sensitivity to N(t)RTIs. The patients were random¬ 
ized in a 1:1 ratio to receive oral rilpivirine 25 mg once daily or efavirenz 600 
mg once daily, in addition to an investigator-selected regimen of background 
N(t)RTIs. 

The outcome of interest to us is virologic response (viral load below 50 
copies/ml) at week 48 of treatment, with patient discontinuation (about 
5%) counted as failure. The observed virologic response rates are 86 % and 
82% in the rilpivirine and efavirenz groups, respectively, and the difference 
between the two groups (3.5%; 95% Cl: —1.7-8. 8 %) meets a prespecified 
noninferiority criterion based on a 12% margin (p < 0.0001). Thus, rilpivirine 
appears comparable to, if not better than, efavirenz in terms of population- 
level efficacy. However, Figure 1 suggests that individual patients respond 
differently to the two therapies and that baseline viral load and CD4 cell 
count could be used as predictive biomarkers. As indicated earlier, we will 
for safety reasons define individual-level treatment benefit as B = 7{Y(0) < 
Y(l)}, where Y(0) and Y(l) denote potential outcomes for efavirenz and 
rilpivirine, respectively. 

Baseline viral load and CD4 cell count are both log-transformed before 
entering the benefit and outcome models as covariates. In addition to these 
biomarkers, the covariate vector X also includes gender, race (black, white 
or other), age and body mass index at baseline. For the direct approach of 
Section 3.1, the benefit model is a logistic regression model given by ( 6 ), 
with the aforementioned covariates as linear terms (and no interactions), 
and the helper model is given by ( 8 ) with the same link. For the indirect 
approach of Section 3.2, the outcome model is a logistic regression model 
similar to (13) except that interactions of X with T are limited to the two 
biomarkers. The selection of covariates and interactions in these models is 
based on subject matter knowledge and not on statistical tests. Estimation 
of model ( 8 ) is based on the 1% of pairs (of control and experimental pa¬ 
tients) that are most similar in terms of ||X* — Xy 11, as suggested by Dodd 
and Pepe (2003). (Other proportions, from 0.001 to 1, have been attempted 
without yielding a material difference.) Under the indirect approach, the 
simplified method in Web Appendix B is used to estimate 6* for a given 
By. In any case, estimates of 7 Tx(x) are converted into estimates of ir z(z) 
using the kernel smoothing method of Section 3.3, with a Gaussian kernel 
and a cross-validated bandwidth. The cross-validation is based on a 1:1 
random partition of the sample into a training set and a validation set, and 
the bandwidth is chosen among {2 l sd (Z) : l = —5,..., 5} using the minimum 
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mean squared error criterion. The formulas of Section 3.4 are used to ob¬ 
tain estimates of ROC curves, and the trapezoidal rule is then employed 
in calculating the associated AUCs. The above procedure is performed for 
both biomarkers on the original sample as well as 200 bootstrap samples. 
Pointwise 90% confidence intervals for ROC curves are obtained using a 
simple bootstrap percentile method, and bootstrap standard errors are used 
for inference on AUCs (and AUC differences between the two biomarkers). 

Figure 2 gives a side-by-side comparison of ROC curves for the two 
biomarkers, estimated using the direct approach of Section 3.1 with 7 = 
0.71,1,2,4. The value 7 = 1 corresponds to assumption (4) of conditional 
independence, while the value 0.71 is a theoretical lower bound. The asso¬ 
ciated AUC estimates and standard errors are shown in the upper half of 
Table 1. Both Figure 2 and the relevant portion of Table 1 show that the two 
biomarkers are useful as predictive biomarkers, with ROC curves above the 
diagonal and AUCs greater than 0.5 after accounting for sampling variabil¬ 
ity. The performance of each biomarker does appear to depend heavily on the 
value of 7; the AUC estimate increases dramatically with increasing 7. This 
pattern is confirmed by additional analyses based on other values of 7 (re¬ 
sults not shown). Given the remarks at the end of Section 3.2, these results 
suggest that evaluation of a predictive biomarker can be rather sensitive to 
an unmeasured effect modifier. On the other hand, for each value of 7, the 
AUC estimate for baseline viral load is higher than that for baseline CD4 cell 
count, although the difference is not statistically significant. These results 
suggest that comparison of predictive biomarkers may be less sensitive to 
the choice of 7, even though this particular data set does not provide strong 
evidence that baseline viral load is better than baseline CD4 cell count as 
a predictive biomarker. Whether we are evaluating a single marker or com¬ 
paring two markers, there is an obvious relationship between increasing 7 
and greater variability in the estimates, as indicated by widening confidence 
intervals in Figure 2 and increasing standard errors in Table 1. 

Figure 3 gives another comparison of the two biomarkers based on ROC 
curves estimated using the indirect approach of Section 3.2 with 6 ^ = 0,1.8, 
4,8. Here, the value 6^ = 0 corresponds to conditional independence, and 
the value 1.8 is a lower confidence bound (to be discussed later). The as¬ 
sociated AUC estimates and standard errors are shown in the lower half of 
Table 1. These results are consistent with those from the direct approach 
in suggesting that both biomarkers are useful as predictive biomarkers. In 
particular, the results for conditional independence (d'jj = 0) are fairly con¬ 
sistent with their counterparts under the direct approach (with 7 = 1). Like 
Figure 2, Figure 3 also shows an increasing trend for the predictive perfor¬ 
mance of each biomarker as a function of the sensitivity parameter 8 The 
trend is confirmed for additional values of 9^, although the results are not 
shown. Thus, considering the remarks at the end of Section 3.2, it appears 
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Fig. 2. ROC analysis of the THRIVE data using the direct approach: estimated ROC 
curves (solid lines) and 90% pointwise confidence intervals (dashed lines) for baseline viral 
load (RNA, left) and CD) cell count (right), with different values of 7. 


that evaluation of a predictive biomarker can also be sensitive to an unmea¬ 
sured prognostic factor. The results from the indirect approach reinforce the 
previous observation that baseline viral load appears to perform better than 
baseline CD4 cell count, although the differences here also fail to reach sta¬ 
tistical significance. As is the case with the direct approach, increasing 0y 
tends to produce greater variability in the estimates. 
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Table 1 

AUC analysis of the THRIVE data: point estimates and bootstrap standard errors for the 
AUCs of baseline viral load (RNA) and CD4 cell count as well as their difference 
(RNA-CDf), obtained using the direct approach of Section 3.1 and the indirect approach 
of Section 3.2 with sensitivity parameters 7 and 9^, respectively 


Sensitivity 

parameter 

Point estimate 

Standard error 

RNA 

CD4 

Diff. 

RNA 

CD4 

Diff. 

7 



Direct approach 



0.71 

0.61 

0.58 

0.03 

0.04 

0.03 

0.03 

1 

0.65 

0.63 

0.03 

0.05 

0.04 

0.05 

2 

0.77 

0.75 

0.02 

0.06 

0.08 

0.08 

4 

0.88 

0.80 

0.08 

0.07 

0.10 

0.13 

e*u 



Indirect approach 



0 

0.64 

0.59 

0.05 

0.04 

0.04 

0.05 

1.8 

0.65 

0.60 

0.06 

0.05 

0.05 

0.05 

4 

0.68 

0.64 

0.04 

0.07 

0.07 

0.08 

8 

0.72 

0.70 

0.02 

0.09 

0.09 

0.11 


Although the uncertainty in comparing the two biomarkers could poten¬ 
tially be reduced by an increased sample size, some uncertainty will likely 
remain in evaluating each individual marker, given the apparent dependence 
on sensitivity parameters. Nonetheless, the increasing trend observed under 
both approaches suggests that a lower bound for predictive performance may 
be available under each approach. For the direct approach, the lower bound 
is given by 7 = 0.71, as noted in Section 3.1. For the indirect approach, a 
lower bound for 6y (and hence for the performance of each biomarker) may 
be available from longitudinal data (see Web Appendix B). For the THRIVE 
study, the lower bound for 9y is estimated as 2.5 (95% Cl: 1.8-3.7, based 
on 1000 bootstrap samples) from a GLMM analysis of repeated measure¬ 
ments at 24, 32, 40 and 48 weeks. Although earlier measurements (baseline 
through 20 weeks) are also available, we restrict our analysis to the later mea¬ 
surements in order to reduce misspecification bias; see Zhang et al. [(2013), 
Section 4] for a detailed explanation of this strategy. This GLMM analysis 
suggests that, under the additional assumptions given in Web Appendix B, 
the value 9y = 1.8 represents the worst case scenario for the indirect ap¬ 
proach. The corresponding ROC and AUC estimates are better than those 
for 7 = 0.71 under the direct approach and thus more informative as lower 
bounds. 

Our ROC analyses under both (direct and indirect) approaches also illus¬ 
trate that measures of predictive accuracy need not correlate with interac¬ 
tions. Although baseline CD4 cell count exhibits a more dramatic interaction 
in Figure 1, there is no indication (in the same data set) that it outperforms 
baseline viral load as a predictive biomarker. 
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FPR 


FPR 


Fig. 3. ROC analysis of the THRIVE data using the indirect approach: estimated ROC 
curves (solid lines) and 90% pointwise confidence intervals (dashed lines) for baseline viral 
load (RNA, left) and CD4 cell count (right), with different values of 9(j. 


5. Discussion. In this article we have proposed new methods for evalu¬ 
ating predictive biomarkers in the potential outcome framework of Huang, 
Gilbert and Janes (2012). Instead of monotonicity, our starting point is 
conditional independence of potential outcomes given observed covariates. 
Possible departures from the latter assumption can be addressed by incor¬ 
porating a random effect that accounts for any residual dependence between 
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potential outcomes. Because the random effect models are not completely 
identifiable, we propose a sensitivity analysis approach based on quantities 
related to the random effect. Our analysis of the THRIVE data reveals a 
great deal of sensitivity for the performance of each individual biomarker and 
much less sensitivity for the comparison of the two markers. Despite the un¬ 
certainty about individual biomarkers, the lower bounds on their predictive 
performance, available under both (direct and indirect) approaches, show 
that they are both useful as predictive biomarkers. For comparing the two 
markers, our analysis does not show much sensitivity and does not indicate 
a significant difference in the predictive performance of the two markers. 

There does appear to be a lot of sensitivity (particularly in the upper 
portion) in quantifying the performance of an individual marker. Such sen¬ 
sitivity introduces additional uncertainty into the overall conclusion of the 
analysis, which is certainly undesirable. We see this as a reminder of the 
inherent limitation of the observed data for answering certain questions. As 
in many other statistical problems (e.g., missing data, censoring, confound¬ 
ing), the parameters of interest to us are not identifiable from the observed 
data without making untestable assumptions. When such assumptions are 
in doubt, the parameters are partially identified and the associated inference 
cannot be as sharp as that for point-identified parameters [Manski (2003)] . It 
may be disappointing to see that different assumptions (about the sensitiv¬ 
ity parameter) can lead to a wide range for the parameter of interest. On the 
other hand, one could argue that the sensitivity analysis serves its purpose 
well by revealing the limitation of the available data and information. 

In theory, point identification is achieved under the conditional inde¬ 
pendence assumption (6), which requires that the observed covariates be 
sufficient for explaining the dependence between the potential outcomes. 
Although we may never be certain that assumption (6) holds, it seems rea¬ 
sonable to believe that the assumption will get close to being true with a 
growing set of relevant covariates obtained from increasing knowledge of 
the disease. The more we know about the disease, the more information we 
have about relevant covariates, the more confidence we should be able to 
place in assumption (6). In practice, it may be difficult to determine when 
we have sufficient information to rely on assumption (6) and when we have 
to perform a sensitivity analysis. One possible solution would be a Bayesian 
approach with an informative prior on the sensitivity parameter which quan¬ 
tifies our uncertainty about assumption (6). As we become more confident 
about assumption (6), the prior will become more concentrated at or near 
that assumption. 

Each of the direct and indirect approaches has unique advantages. The 
direct approach is able to accommodate complex definitions of treatment 
benefit involving several outcome variables of arbitrary types, as long as 
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they are all observed. The indirect approach is best suited for a single out¬ 
come of primary interest. Although the outcome model can include multiple 
outcomes in principle, their dependence structure can be difficult to specify 
and estimate. Even for a single outcome, the indirect approach requires a 
greater amount of modeling (in the sense that an outcome model implies 
a benefit model) and is therefore more prone to misspecification bias. On 
the other hand, the indirect approach is able to use all the information in 
the observed outcome data and therefore may have an efficiency advantage. 
Finally, for sensitivity analysis, the direct approach is more appropriate for 
an unmeasured effect modifier, and the indirect approach for an unmeasured 
prognostic factor. In practice, we recommend that both approaches be used 
in a complementary fashion, as in our analysis of the THRIVE data. 

A reviewer has pointed out that some elements of personal judgment may 
be involved in choosing among different treatments. While this is not a ma¬ 
jor issue in the THRIVE data, where the definition of a treatment benefit is 
quite objective, it can certainly become a major issue in other therapeutic 
areas such as weight loss. For example, some patients may be willing to ac¬ 
cept the extra risks of a surgical procedure (relative to a nonsurgical one) for 
an additional loss of 20 pounds, while others may not. To incorporate such 
personal judgment into the proposed approach, we could allow 6 to vary 
across patients, and we would need to be able to measure 5 for individual 
patients or at least be able to predict 6 using measurable individual char¬ 
acteristics. In the latter case, the methodology will need to be modified to 
incorporate a prediction model for 5 and the associated variability. It will be 
of interest to explore that possibility in the context of a suitable application. 

Because of the complexity of the proposed methodology, a sample size 
formula is not yet available; however, for a given application, one could 
perform a simulation study to gauge the adequacy (in terms of power and 
precision) of a proposed sample size. Such an assessment should obviously 
consider the objective of the analysis (e.g., evaluating a single biomarker 
versus comparing two or more biomarkers). In addition, the nonparametric 
regression in Section 3.3 may imply a higher requirement on the sample size 
than do the other parts of the methodology, which are based on parametric 
regression techniques. A sample size that is inadequate for one-dimensional 
nonparametric regression may compromise the performance of the methods. 
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SUPPLEMENTARY MATERIAL 

Supplement: Technical details (DOI: 10.1214/14-AOAS773SUPP; .pdf). 
We provide technical details concerning the sensitivity analyses in Sec¬ 
tions 3.1 and 3.2. 
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